[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp#264
Merged
YWHyuk merged 3 commits intoJun 18, 2026
Conversation
…ases Three fixes from the max-effort review of this branch: - get_dma_info: after retiring the floor/mod recompile branches, a residual floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible radix) that axis-split/graph-copy did not linearize was silently bucketed by its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor. Raise NotImplementedError instead of mis-striding silently. No test triggers it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net. - decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of each reassociation group, which is a unit dim when trailing unit dims attach after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the wrong axis. Pick the non-unit dim in each group. - decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine). Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle >4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
More fixes from the max-effort review, after verifying each against the C++ reference and reachability: - graph_copy _relayout_args: ranges picked the consumer iteration shape by rank alone (max key=len), so for two equal-rank operands with different per-dim extents the broadcast-from operand's smaller shape could win and the real incompatible-radix conflict on the broadcast-to dim was missed (order-dependent: a commutative reorder flipped correct relayout into a silent miss). Use per-dim max extent over the max-rank operands. - lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType -- F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm, as in the validated path) instead of being lowered to VCIX, and add the missing rank != 1 guard. - lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick one silently). Latent today (heuristic/autotune only emit SS-multiple tiles). - Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the two stale 'no-op unless set' comments. Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite, softmax, layernorm -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix exp chunk Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but the gemm/bmm/conv tests do not: - _lower_matmul bailed with 'if ATag is None or BTag is None: return False', gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V matmul, operand B is the softmax output produced in place by affine.vector_store, not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by either a dma_start OR a preceding affine.vector_store into its root memref; bail only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++). - _make_sf_vc_v_iv n>1 transcendental chunking called vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order, missing the result type and vector operand, raising TypeError under these MLIR bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile (n==1) tests never hit it. Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong); matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did not; exp chunking and fine-grained ruled out separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes from the max-effort code review of #259, each verified against the C++ reference and reachability. Squashed/folded with the former #265 (SDPA vcix fix).
Correctness
Hardening / parity / docs
Refuted (no change): dma_fine_grained bias tag map (Python writer/wait mutually consistent), build_tog loop-stride dedup (no duplicate keys emitted).
Validated end-to-end (Spike+TOGSim allclose): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite (incl. pixel_shuffle), reduce, softmax, layernorm, batchnorm, gqa, SDPA (was crash/wrong) -- all pass. SDPA bisected: C++ vcix passes, Python vcix did not; localized to _lower_matmul.